In [1]:
import graphlab
In [2]:
people = graphlab.SFrame('people_wiki.gl/')
Data contains: link to wikipedia article, name of person, text of article.
In [3]:
people.head()
Out[3]:
In [4]:
len(people)
Out[4]:
In [5]:
obama = people[people['name'] == 'Barack Obama']
In [6]:
obama
Out[6]:
In [7]:
obama['text']
Out[7]:
In [8]:
clooney = people[people['name'] == 'George Clooney']
clooney['text']
Out[8]:
In [9]:
obama['word_count'] = graphlab.text_analytics.count_words(obama['text'])
In [10]:
print obama['word_count']
In [11]:
obama_word_count_table = obama[['word_count']].stack('word_count', new_column_name = ['word','count'])
In [12]:
obama_word_count_table.head()
Out[12]:
In [13]:
obama_word_count_table.sort('count',ascending=False)
Out[13]:
Most common words include uninformative words like "the", "in", "and",...
In [14]:
people['word_count'] = graphlab.text_analytics.count_words(people['text'])
people.head()
Out[14]:
In [15]:
tfidf = graphlab.text_analytics.tf_idf(people['word_count'])
tfidf
Out[15]:
In [16]:
people['tfidf'] = tfidf['docs']
In [17]:
obama = people[people['name'] == 'Barack Obama']
In [18]:
obama[['tfidf']].stack('tfidf',new_column_name=['word','tfidf']).sort('tfidf',ascending=False)
Out[18]:
Words with highest TF-IDF are much more informative.
In [19]:
clinton = people[people['name'] == 'Bill Clinton']
In [20]:
beckham = people[people['name'] == 'David Beckham']
In [21]:
graphlab.distances.cosine(obama['tfidf'][0],clinton['tfidf'][0])
Out[21]:
In [22]:
graphlab.distances.cosine(obama['tfidf'][0],beckham['tfidf'][0])
Out[22]:
In [23]:
knn_model = graphlab.nearest_neighbors.create(people,features=['tfidf'],label='name')
In [24]:
knn_model.query(obama)
Out[24]:
As we can see, president Obama's article is closest to the one about his vice-president Biden, and those of other politicians.
In [25]:
swift = people[people['name'] == 'Taylor Swift']
In [26]:
knn_model.query(swift)
Out[26]:
In [27]:
jolie = people[people['name'] == 'Angelina Jolie']
In [28]:
knn_model.query(jolie)
Out[28]:
In [29]:
arnold = people[people['name'] == 'Arnold Schwarzenegger']
In [30]:
knn_model.query(arnold)
Out[30]:
In the notebook we covered in the module, explored two document representations: word counts and TF-IDF. Now, take a particular famous person, 'Elton John'. What are the 3 words in his articles with highest word counts? What are the 3 words in his articles with highest TF-IDF? These results illustrate why TF-IDF is useful for finding important words. Save these results to answer the quiz at the end.
In [38]:
people.head(2)
Out[38]:
In [41]:
elton_john = people[people['name'] == 'Elton John']
In [42]:
elton_john.head()
Out[42]:
In [46]:
elton_john[['word_count']].stack('word_count', new_column_name = ['word','count']).sort('count',ascending=False)
Out[46]:
In [48]:
elton_john[['tfidf']].stack('tfidf', new_column_name = ['word','tfidf']).sort('tfidf',ascending=False)
Out[48]:
Elton John is a famous singer; let’s compute the distance between his article and those of two other famous singers. In this assignment, you will use the cosine distance, which one measure of similarity between vectors, similar to the one discussed in the lectures. You can compute this distance using the graphlab.distances.cosine function. What’s the cosine distance between the articles on ‘Elton John’ and ‘Victoria Beckham’? What’s the cosine distance between the articles on ‘Elton John’ and Paul McCartney’? Which one of the two is closest to Elton John? Does this result make sense to you?
In [49]:
victoria_beckham = people[people['name'] == 'Victoria Beckham']
In [50]:
paul_mccartney = people[people['name'] == 'Paul McCartney']
In [51]:
graphlab.distances.cosine(elton_john['tfidf'][0],victoria_beckham['tfidf'][0])
Out[51]:
In [52]:
graphlab.distances.cosine(elton_john['tfidf'][0], paul_mccartney['tfidf'][0])
Out[52]:
In the sample notebook, we built a nearest neighbors model for retrieving articles using TF-IDF as features and using the default setting in the construction of the nearest neighbors model. Now, you will build two nearest neighbors models:
In both of these models, we are going to set the distance function to cosine similarity.
Here is how: when you call the function
add the parameter:
In [53]:
word_count_cosine_model = graphlab.nearest_neighbors.create(people,features=['word_count'],label='name',distance='cosine')
In [54]:
tfidf_cosine_model = graphlab.nearest_neighbors.create(people,features=['tfidf'],label='name',distance='cosine')
In [56]:
word_count_cosine_model.query(elton_john)
Out[56]:
In [57]:
tfidf_cosine_model.query(elton_john)
Out[57]:
In [59]:
word_count_cosine_model.query(victoria_beckham)
Out[59]:
In [60]:
tfidf_cosine_model.query(victoria_beckham)
Out[60]: